Search CORE

23 research outputs found

Entities as topic labels: Improving topic interpretability and evaluability combining Entity Linking and Labeled LDA

Author: Fabo Pablo Ruiz
Nanni Federico
Publication venue
Publication date: 01/01/2016
Field of study

In order to create a corpus exploration method providing topics that are easier to interpret than standard LDA topic models, here we propose combining two techniques called Entity linking and Labeled LDA. Our method identifies in an ontology a series of descriptive labels for each document in a corpus. Then it generates a specific topic for each label. Having a direct relation between topics and labels makes interpretation easier; using an ontology as background knowledge limits label ambiguity. As our topics are described with a limited number of clear-cut labels, they promote interpretability, and this may help quantitative evaluation. We illustrate the potential of the approach by applying it in order to define the most relevant topics addressed by each party in the European Parliament's fifth mandate (1999-2004).Comment: in Proceedings of Digital Humanities 2016, Krako

arXiv.org e-Print Archive

MAnnheim DOCument Server

Automatic Enjambment Detection as a New Source of Evidence in Spanish Versification

Author: Clara Martínez Canton
Pablo Fabo Ruiz
Publication venue
Publication date
Field of study

Plotting Poetry / Machiner la poésie, Basel 201

ZENODO

FigShare

Closing Remarks: What Was This All About?

Author: Bories Anne-Sophie
Fabo Pablo Ruiz
Plecháč Petr
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2022
Field of study

Centered on the main question of poetics and poeticity, this volume provides a broad overview of computational methods including motif analysis, network analysis, machine learning, and natural language processing. Without limiting ourselves to poetry, we explore the poetics of various literary productions in verse or in prose, as well as experiments towards the computational generation of poems. The volume is meant to gather a representative set of such approaches, and to offer a space for sharing perspectives, practices, and inspiring insights into the issues, old and new, being addressed by digital literary studies

edoc

Navegación de corpus a través de anotaciones lingüísticas automáticas obtenidas por Procesamiento del Lenguaje Natural: de anecdótico a ecdótico

Author: Bermúdez Sabel Helena
Ruiz Fabo Pablo
Publication venue: 'UNED - Universidad Nacional de Educacion a Distancia'
Publication date: 01/11/2019
Field of study

The article presents two case studies on the application of Natural Language Processing (NLP) technologies or Computational Linguistics to create corpus navigation interfaces. These interfaces help access relevant information for specific research questions in Social Sciences or Humanities. The paper also focuses on how these technologies for automatic text analysis can allow us to enrich scholarly digital editions. The theoretical framework that connects the aforementioned technologies with academic digital edition is described, and a reflection is made on the application of such technologies and interfaces to digital scholarly editing

Serveur académique lausannois

HAL Descartes

Revista de Humanidades Digitales

REVISTAS CIENTÍFICAS UNED. Servicio de Publicación y Difusión Digital. Biblioteca UNED

Plotting Poetry: On mechanically enhanced reading, 5–7 October 2017, Basel, Switzerland

Author: Martínez Cantón Clara
Plecháč Petr
Ruiz Fabo Pablo
Seláf Levente
Publication venue: 'University of Tartu'
Publication date: 31/12/2017
Field of study

Plotting Poetry: On mechanically enhanced reading, 5–7 October 2017, Basel, Switzerlan

Journals from University of Tartu

Entities as topic labels : combining entity linking and labeled LDA to improve topic interpretability and evaluability

Author: Lauscher Anne
Nanni Federico
Ponzetto Simone Paolo
Ruiz Fabo Pablo
Publication venue: Accademia University Press
Publication date: 01/01/2016
Field of study

Digital humanities scholars strongly need a corpus exploration method that provides topics easier to interpret than standard LDA topic models. To move towards this goal, here we propose a combination of two techniques, called Entity Linking and Labeled LDA. Our method identifies in an ontology a series of descriptive labels for each document in a corpus. Then it generates a specific topic for each label. Having a direct relation between topics and labels makes interpretation easier; using an ontology as background knowledge limits label ambiguity. As our topics are described with a limited number of clear-cut labels, they promote interpretability and support the quantitative evaluation of the obtained results. We illustrate the potential of the approach by applying it to three datasets, namely the transcription of speeches from the European Parliament fifth mandate, the Enron Corpus and the Hillary Clinton Email Dataset. While some of these resources have already been adopted by the natural language processing community, they still hold a large potential for humanities scholars, part of which could be exploited in studies that will adopt the fine-grained exploration method presented in this paper

Universität Mannheim: MADATA - Mannheim Research Data Repository

MAnnheim DOCument Server

Lexical Normalization of Spanish Tweets with Rule-Based Components and Language Models

Author: Cuadros M. (Montse)
Etchegoyhen T. (Thierry)
Ruiz Fabo P. (Pablo)
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 30/03/2014
Field of study

This paper presents a system to normalize Spanish tweets, which uses preprocessing rules, a domain-appropriate edit-distance model, and language models to select correction candidates based on context. The system is an improvement on the tool we submitted to the Tweet-Norm 2013 shared task, and results on the task's test-corpus are above-average. Additionally, we provide a study of the impact for tweet normalization of the different components of the system: rule-based, edit-distance based and statistical

univOAK

Retos, perspectivas y próximos pasos dentro de POSTDATA

Author: Helena Bermúdez Sabel
Pablo Ruiz Fabo
Publication venue
Publication date
Field of study

Presentación en el HDH2017 de Málag

ZENODO

FigShare

Opportunités et limites en stylistique computationnelle de la poésie : Détection automatique de l'enjambement en anglais

Author: Monget E. (Eulalie)
Ruiz Fabo P. (Pablo)
Publication venue
Publication date: 14/05/2020
Field of study

Plusieurs initiatives internationales témoignent de l'intérêt actuel pour les analyses littéraires assistées par des moyens computationnels, comme le groupement Digital Literary Stylistics (SIG-DLS) de l'Alliance for Digital Humanities Organizations. Le colloque Plotting Poetry montre la variété de phénomènes poétiques abordés à l'aide d'outils informatiques (https://plottingpoetry.wordpress.com/). Un numéro thématique de la revue Langages (2015) en fournit une autre synthèse. Les projets impliquant l'annotation linguistique automatique pour l'analyse littéraire partagent certains soucis : comment opérationnaliser des concepts d'analyse littéraire sur la base d'annotations issues des outils de Traitement automatique des langues (TAL), originalement conçues pour une analyse non-littéraire ? Comment évaluer nos annotations automatiques d'un trait stylistique, en termes de, et au-delà de, la comparaison avec des données de référence annotées manuellement ? Quel gain de connaissance spécifique à l'analyse littéraire est atteignable à travers l'annotation stylistique automatique, qui serait impossible sans traitement informatique ? C'est des questions qui nous occupent également dans notre projet sur la détection automatique de l'enjambement dans la poésie en anglais. L'enjambement implique une discordance entre les pauses requises par la structure métrique (fins de vers ou hémistiche) et des pauses demandées par la syntaxe ou le sens (cf. Golomb, 1979, p. 269). On le rencontre souvent lorsqu'un un syntagme est éclaté sur deux vers successifs, contrariant l'attente d'une pause à la fin du premier vers. Hors cette caractérisation générale, la définition de l'enjambement ne fait pas consensus (cf. Quilis, 1964 ; Hollander, 1975 ; Golomb, 1979 ; Hussein et al., 2018 ; Delente, 2019). C'est une raison pour développer des logiciels qui implémentent les différentes définitions possibles : en détectant automatiquement leurs occurrences sur un grand corpus, les atouts et limites de chaque définition pourront être examinés au vue d'un exemplier large. De plus, il n'y a pas d'études publiées sur la détection automatique de l'enjambement en anglais, contrairement à l'allemand (Hussein et al., 2018) ou espagnol (Ruiz et al., 2017, http://prf1.org/anja/index/). Concernant l'opérationnalisation, nous adoptons une définition à base largement syntaxique (Quilis, 1964) : l'enjambement se produit quand la fin de vers coupe certaines séquences à forte cohésion interne. Ses atouts : premièrement, la facilité d'opérationnalisation. La définition implique des séquences d'étiquettes grammaticales, des dépendances et constituants syntaxiques, fournis par les librairies de TAL. Deuxièmement, l'intérêt de vérifier si cette approche, déjà appliquée en espagnol (Ruiz et al., 2017), serait applicable à l'anglais. Nous avons constaté des limites, ayant modifié la typologie pour mieux gérer l'anglais (voir https://git.unistra.fr/enj/corpus-reference) ; plus généralement, Delente (2019) discute les limites des définitions syntaxiques. La qualité des résultats du TAL décroît pour les textes littéraires (Bamman, 2017). Or, des gains de qualité dans multiples tâches de TAL ont été récemment obtenus par les modèles neuronaux, que nous exploitons, avec les librairies spaCy (Honnibal et Montani, 2017) et AllenNLP (Gardner et al., 2017). Notre étude est une opportunité pour tester leur robustesse sur un corpus exigeant. Pour l'évaluation, nous avons annoté manuellement l'enjambement, selon notre définition, dans 60 poèmes de genres variés des 19e et 20e siècles (voir https://git.unistra.fr/enj/corpus-reference). Le corpus servira à comparer la détection automatique avec l'annotation humaine. Au-delà, on voudrait annoter automatiquement un corpus diachronique large pour déceler de possibles tendances dans la distribution de l'enjambement. Nous aimerions échanger avec la communauté sur les sujets que permet d'examiner ce projet : l'adoption et adaptation de technologies linguistiques pour l'opérationnalisation de concepts littéraires, les problèmes d'évaluation en annotation stylistique automatique, et le potentiel et limites des approches pour contribuer à des nouvelles connaissances en littérature

univOAK

Entities as Topic Labels: Combining Entity Linking and Labeled LDA to Improve Topic Interpretability and Evaluability

Author: Lauscher Anne
Nanni Federico
Ponzetto Simone Paolo
Ruiz Fabo Pablo
Publication venue: 'OpenEdition'
Publication date: 15/12/2020
Field of study

OpenEdition